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Abstract 

The temporal order of cancer gene mutations in tumors is essential for understanding and treating the disease. Existing 
methods are unable to infer the order of mutations that are identified at the same time in individual tumor samples, leaving 
the heterogeneity of the order unknown. Here, we show that through a complex network-based approach, which is based 
on the newly defined statistic -carcinogenesis information conductivity (CIC), the temporal order in individual samples can be 
effectively inferred. The results suggest that tumor-suppressor genes might more frequently initiate the order of mutations 
than oncogenes, and every type of cancer might have its own unique order of mutations. The initial mutations appear to be 
dedicated to acquiring the function of evading apoptosis, and some order constraints might reflect potential regularities. 
Our approach is completely data-driven without any parameter settings and can be expected to become more effective as 
more data will become available. 
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Introduction 

Cancer is a genetic disease caused by the mutation of cancer 
genes consisting of oncogenes and tumor-suppressor genes. In 
most cancer cases, multiple mutations occur in a procedure known 
as tumor progression [1,2]. To understand tumor progression, 
studies have been performed to model general regularities on the 
temporal order of mutations for a given type of cancer using both 
experimental and computational approaches [3-7] . As a canonical 
model, the order of mutations for colorectal cancer was 
reconstructed through tumor size and grade [8]. The latest 
computational models infer the typical temporal order constraints 
for certain type of cancers by simulating tumor progression as a 
stochastic process [9-11]. Despite this progress, there is still no 
well-defined method to infer the order of mutations identified at 
the same time in individual samples, although this inference is 
necessary to reveal the heterogeneity of the order of mutations in a 
cancer. Recently, as new generation sequencing becomes widely 
applied, the mutation landscapes in various cancers are being 
revealed one by one. The results have shown that the mutations in 
a cancer frequently demonstrate statistical correlations with each 
other or even cause-and-effect linkages of induction between the 
former and the latter [12-18]. However, these correlations/ 
linkages have not been fully exploited in inferring the temporal 
order of mutations. 

From an informatic perspective, this study defines a statistical 
measurement to assign value to the correlations or linkages 
mentioned above and model the mutations within a complex 
network, through which the temporal order of the mutations in 



individual samples can be inferred. We call the measurement the 
carcinogenesis information conductivity (CIC), which measures the 
reachability of transferring the information of a cancer gene 
having mutated to the transcription process of a given un-mutated 
cancer gene to induce its mutation. Statistically, the reachability 
can be estimated by the individual occurrence frequencies and the 
sequential co-occurrence frequency of the two genes' mutations in 
cancer samples. Additionally, competition among the information 
sent from multiple mutated genes to the given un-mutated gene 
should also be considered as any successful sending will cause the 
target gene to mutate, thus ending the mutation process. In this 
study, we call any two mutations found out in the same cancer 
sample co-occurrent mutations. While most genomic studies 
provide this quantity in an indirect way, here we aim at 
disentangling the sequence of occurrence of two mutational events 
from the simple co-occurrence. From these sequences of mutation 
occurrence, the sequential co-occurrence frequency can be 
calculated (Materials and Methods). Based on this idea, we have 
defined the CIC from cancer gene i to cancer gene j as: 



'ft f pI 



(i) 



where ft (//) is the occurrence frequency of the mutation of gene i (/) 
in cancers, fy is the sequential co-occurrence frequency of the 
mutation of gene i followed by the mutation of gene j, and is the 
priority of gene i compared to other mutant genes to send the 
information to gene j. We have determined that 



PLOS ONE | www.plosone.org 



1 



February 2014 | Volume 9 | Issue 2 | e89244 



Inferring the Temporal Order of Mutations 



Pij = ' X (1 + S Hfkj >fij\ s ))- I 11 this equation Sy is the set of 

cancer samples with mutant genes i andj, |S,y| is the number of 
samples in the set, and I(fkj>fy\ s ) is an indicator function that 
equals 1 if fy > fy for the mutant genes k, j and i in sample s. 
Otherwise it equals 0. Accordingly, the highest priority of one will 
be assigned if/« is larger than fy in every sample of the set, and 
the more times that fy>fi], the larger value the py. We regard 
formula (1) as a measurement of carcinogenesis information 
conductivity because the ratio (fy/fi) is an estimate of the 
maximum chance that gene i sends carcinogenesis information to 
genej and causes its mutation, the ratio (fy/fj) is an estimate of the 
maximum chance that the mutation of gene j is caused by 
carcinogenesis information received from gene i, and py is the 
priority of the communication link compared with other links to 
gene j. The value of Cy ranges from 0 to 1 . Like the definition of 
activation force, a measurement we previously proposed for 
weighting the links of complex networks [19], the definition of 
CIC follows the formula of gravity if we imagine the ratios (fy/fi) 
and (fy/fj) as masses and the priority py as distance. Statistics 
defined in this manner are likely to distribute their values in a 
power law, which is convenient for analyzing complex networks of 
intricate relationships including those in biology [20-24]. 

One challenge in computing the CICs is the lack of cancer 
samples that can be used as the source of the sequential co- 
occurrence frequencies of the cancer gene mutations because the 
mutations of different genes in a cancer sample are usually 
identified at the same time by sequencing. To tackle this challenge, 
we present an iterative procedure that couples CIC computation 
and the inference of the probability of every potential order of 
cancer gene mutation. The application of this procedure to the 
Catalogue of Somatic Mutations in Cancer (COSMIC) database 
[25,26] revealed that the iteration reached convergence within 
fewer than 1 0 loops, and the convergent results suggest significant 
conclusions. 

Materials and Methods 

Iterative inference scheme 

To perform the iterative inference procedure, a large set of 
cancer samples with cancer gene mutations identified by genome- 
wide sequencing is necessary. With the dataset, we determine the 
basic statistics of occurrence and non-sequential co-occurrence 
frequencies of cancer gene mutations. From these basic statistics, 
the iterative inference for the number of samples in question 
begins and the CIC results and probable orders of cancer gene 
mutation for each sample in question are determined when the 
iteration reaches convergence. Fig. 1 illustrates an overview of the 
procedure. 



and inference of the mutation orders until a convergent result is 
obtained. 

Based on the principle of maximum entropy, we first use a 
uniform prior distribution of the occurrence orders, which means 
that for the non-sequential co-occurrence frequency of the 
mutation of two genes i and j, the two mutation orders of i—>j 
andj-H>i occur with the same probability. Therefore, the necessary 
sequential co-occurrence frequency is set as a half of the 
corresponding non-sequential frequency. With this setting, we 
compute the initial CIC between every pair of cancer genes. 

We then compute the CIC that an order of more than two 
mutant genes possesses. In this computation, we must consider 
that each of the preceding genes may send the carcinogenesis 
information in parallel to a target gene within the order. 
Therefore, we borrow the principle of computing resistance in a 
circuit, which is a parallel-by-serial procedure; we sum all the 
parallel CICs from the preceding genes to a target gene within the 
order to determine the phase CIC of the order and then formulate 
the order CIC by cascading all the phase CICs. Consider the order 
APC-^ATM-^KRAS as an example; this order contains two phases 
of information sending, —>ATM and —>KRAS. During the first 
phase, the information can be sent from only one source, APC 
Therefore, C A pc,atm, the CIC from APC to A TM, simply becomes 
the CIC of the first phase. In the second phase, however, both APC 
and ATM can become the information source, requiring the 
summation of the two parallel CICs as the CIC of the second 
phase. After the parallel step of each phase, the reciprocals of phase 
CICs, regarded as resistances, are serially summed as the reciprocal 
of the order CIC. The steps are summarized as follows: 

Parallel step: 

First phase : ->ATM C l phase = C APC ,ATM 
Second phase : ->KRAS C 2 phase = C A pc,kras + C A tm,kras 
Series step: 

The order CIC : C order = — — 5 * 2 — 

1 1 '-phase + l / ^ phase 



The kth gene in the order is the information receiving gene at 
the (k-l)xh phase and has k-1 senders of parallel information. An 
order consisting of n genes has n-1 phases of carcinogenesis 
information conduction. In general, we have the equation, 



Iterative procedure of CIC computation and inference of 
mutation order 

By definition, sequential co-occurrence frequencies are neces- 
sary to estimate the CIC value. However, this requirement cannot 
be satisfied by the current databases, including COSMIC. To 
overcome this difficulty, we adopt an iterative procedure to couple 
the inference of the occurring mutation orders and the compu- 
tation of the CICs. First, we evenly divide a non-sequential co- 
occurrence frequency into the two possible sequential co- 
occurrence frequencies to calculate the initial CICs. We then 
infer the mutation orders with the initial CICs to repredict the 
sequential co-occurrence frequencies, repeat CIC computation 



Corder R _ j 

£ i/C 

me 

fc=l 

k 

where C k hmc = Q(p),Kk+l) is the CIC of phase k, Q^k+i) is 
p=i 

the CIC from gene i(p) to gene i(k+ 1), and i(x) is the index of the 
gene at positions in the order. 

Based on the definition of the CIC, a larger CIC value of a 
possible order implies easier carcinogenesis information conduc- 
tion within the order. Among all competing orders, the larger the 
CIC value of an order, the greater probability the occurrence of 
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Figure 1. Overview of the inference methodology, (a) The occurrence and co-occurrence frequencies of the cancer gene mutations f t and g,, 
are determined from available samples, where ij= 1,2. ..,m, and m is the number of the cancer genes targeted in the study. An occurrence of a gene 
will be counted if it is mutated in one of the samples, and a co-occurrence of a pair of genes will be counted if both are mutated in one of the 
samples; therefore, gij=gji and gv, =0. (b) Based on the principle of maximum entropy, the initial values of the sequential co-occurrence frequencies 
are set as/jj 0 ' =ji =0.5gij. (c) The carcinogenesis information conductivities, Cy, are calculated from the vector off, and the matrix of/jj. It should be 
noted that Cg might not be equal to Cp implying that the matrix of Qj represents a directed network, (d) For each of the L samples in question, the 
probabilities of every potential order of the mutant genes in sample Pi are computed according to the CICs of each order (Methods), (e) The matrix of 
fp is redetermined by the matrix of gg and the ratio of the probability-weighted number of the orders indicated that / occurs before / to the number 
of co-occurrence frequency, it is important to note that/?' is not equal to fT* in general. If the matrix of ftp has not reached the criterion of 
convergence, the inferred orders will not be regarded as stable and a new loop of the calculation of Q and Pi will be performed. Otherwise (f), the 
orders with a probability higher than random chance and the corresponding probabilities {df^} and {Pf} are regarded as the referred results. For 



example, of all 6 potential orders for a sample with three mutant cancer genes a, b and c, orders 6->a->c and Cl- 
ones due to probabilities of 0.7 and 0.2 (higher than a random chance of 1/6). 
doi:1 0.1 371 /journal.pone.0089244.g001 



*a are identified as the probable 



the order. Therefore, we presume that the CIC of an order is 
positively proportional to the probability of that order occurring. 
When estimating the probability of every potential order by a 
linear mapping from the CICs of all potential orders for a given set 
of mutant genes, the total of the probabilities of all the potential 
orders is equal to one. Formally, for a sample with n mutant cancer 
genes, the number of potential orders is «!; we map the CIC of 
order m (m = 1, 2, . .., «!) into its probability using the equation 

_ _ '"order 
P '"~ n\ 

V c 

i—i order 

After determine the probabilities of every possible order of the 
mutations, we redetermine the predicted sequential co-occurrence 
frequencies as follows: 

r _ I m 

JiJ ~ E J2pi,A^j\o'J+ E EpLW^WJ ' g,j 

l m I m 



where p m is the probability of order m of sample /, /= 1,2,. ..,L, and 
L is the number of samples in question. /(z'->/'|o' K ) is an indicator 
function that equals 1 when gene i occurs before gene j in order m 
of sample / and equals 0 in all other cases, and gy is the non- 
sequential co-occurrence frequency between gene i and gene j. If 
the redetermined fy values are nearly identical to the old ones or 
become convergent, the computed CICs and thus the inferred 
order probabilities can be regarded as reliable outcomes. 
Otherwise, the CICs and the order probabilities have to be 
redetermined in a new loop. The iterative procedure continues in 
this manner until convergence is reached. In practice, the criterion 
of convergence can be regarded as satisfied when the absolute 
difference between the new and old values of fy monotonically 
reduces to a sufficiently small value. 

Because we begin the iterative procedure with an initial 
prediction of the sequential co-occurrence frequencies from non- 
sequential frequencies based on the maximum entropy principle, 
which provides the maximum modification potential of the 
sequential co-occurrence frequencies in the first iteration, the 
modification will decrease gradually and finally become insignif- 
icant. This premise was verified in the study; a satisfying 
convergence was reached within fewer than 10 loops of the 
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Figure 2. Convergence of iterative computation of CICs. The CICs of Ckras.apc (a) and its counterpart Capckras (b) quickly reach convergent 
as the iterations of computation are performed. After 6 iterations, a satisfying convergence has been reached. 
doi:1 0.1 371 /journal.pone.0089244.g002 



inference procedure using a set of samples from the COSMIC 
database. 

The iteration based on COSMIC data reaches convergence 
within 10 loops. Here, we use the computation of CIC from KRAS 
to APC to introduce the procedure in detail. Initially, we calculate 
the occurrence frequencies oificRAS- 125 and/4/>c = 209 and a 
non-sequential co-occurrence frequency gKRAS,APC = 79 from the 
COSMIC database. By defining half of the non-sequential co- 
occurrence frequency (79) as the sequential frequency, we 
determine that /kras,apc — 39.5. When comparing with the 
sequential co-occurrence frequencies from genes other than KRAS 
to the gene APC in each of the 79 samples, /kras.apc is found to 
have an average order of 1.47. Therefore the priority 
Pkras,apc = 1-47 , and the initial value of Ckras.apc = {^-5/ 
125)*(39.5/209)/1.47 2 = 0.028. 

Using the initial CICs between all cancer gene pairs, we estimate 
the probability of every potential mutation occurrence order in each 
sample in the manner described above. According to the 
probabilities, the non-sequential co-occurrence frequencies can be 
unevenly divided into sequential frequencies. For the 79 samples in 
this example, the ratio of KRAS-^APC vs. APC-^> KRAS based on the 
corresponding total probability for each order is 0.28: 0.72. 
Therefore, we update the value of /kras.apc = 79*0.28 = 22. 1, 
and the priority Pkras.apc is then determined with the new 
/kras.apc- With these new values, we redetermine Ckras.apc- 

The convergence of Ckras.apc an d its counterpart Capc.kras 
during the iterations is shown in Fig. 2. This example demon- 
strates that the values reach a satisfying convergence after just 6 
iterations. This example also represents the common situation, 




10 1 I ' 1 

10'" 10' 5 10"" 10~ 3 10' 2 10"' 



order of magnitude of CIC 

Figure 3. The power law-like distribution of CICs. CICs greater 
than 1.0E-6 are present in a power law-like distribution; specifically, the 
logarithm of the number of the CICs versus the logarithm of their orders 
of magnitude yields a piecewise linear relationship. Based on the 
question of whether a power law distribution is appropriate for 
analyzing complex networks and worries about the unreliability of 
undervalued CICs that might be caused by spare data, only the CICs 
greater than 1.0E-6 were directly used in the inference in this study. 
CICs inferred as less than 1.0E-6 were replaced by the threshold for 
smoothing. 

doi:1 0.1 371 /journal.pone.0089244.g003 



thus we ended the computation of CICs after 10 iterations in this 
study. 

Complexity of the inference procedure 

CIC computation has a complexity of 0{n 2 ) if the number of 
cancer genes in the study is n, and the inference of the probabilities 
of all potential orders for a sample with m mutant cancer genes has 
a complexity of 0{m!m 2 ). In our study, n is equal to 397 and m 
ranges from 2 to 8. Therefore, the complexity of 0(m!m 2 ) can differ 
greatly for different samples. In reality, during the inference for the 
1,118 samples reported in the study, the majority of the time was 
consumed by a few samples with the maximum number of mutant 
cancer genes. It is worth noting that during the entire procedure, 
we only have to compute the CICs once in each loop to infer the 
order probabilities for all samples. The inference procedure with 
1 0 iterations for the 1,118 samples was completed within 1 0 
minutes on a platform consisting of a PC (4*2.66 GHz Quad 
CPU) and Madab. 

Study data 

The results reported in this study were obtained from a recent 
COSMIC database (issued on September 12 th , 2012) on coding 
point mutations. It is a table file containing the names of the 
mutated cancer genes in each cancer sampled. Mutant genes in 
the same cancer have the same tumor ID (ID_tunwur), and the 
fields of genome-wide-screen and primary side provide the necessary 
information used in this study. 

Steps for determining the occurrence and co-occurrence 
frequencies of cancer gene mutations in the samples 

The occurrence and co-occurrence frequencies of cancer genes 
in the cancer samples were used to estimate the CICs in the study, 
and the basic statistics were determined using the following steps: 

1) Download the source file CosmicMutantExport_v61_120912.tsv 
through ftp:/ /ftp.sanger.ac.uk/ pub/ CGP/cosmic/ data_export/ ; 

2) Make a temporary file by obtaining the records with the value 'y' 
in the 'genome-wide screen' field from the source file; 

3) Make a primary file by obtaining the records of cancer genes 
defined by the file Table_l _Jull_2012-03-15.xls in the Cosmic 
web site from the temporary file, and refining the records into 
sequences of Gene_name and ID_Sample; 

4) Make a mutation_sequence file in which each record is a list of 
the mutated genes in the same sample based on the primary 
file, and discard the record that contains only one gene name 
in the mutation_sequence file; 

5) Count the occurrence and co-occurrence frequencies of the 
cancer genes based on the mutation_sequence file. 

Results 

Features of the estimated CICs 

We performed the inference on cancer gene mutation data from 
genome-wide scanned samples collected in a recent version of the 
COSMIC database. A total of 1,212 samples harboring 6,281 
mutations in 397 cancer genes was available for determining the 
basic occurrence and co-occurrence frequencies. From these, 
1,118 samples, each harboring no more than 8 mutant cancer 
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Figure 4. An illustration of the asymmetry of CIC-linked cancer gene networks. Forty-four frequently mutated cancer genes (in more than 
20 genome-wide scanned samples in the COSMIC database) are illustrated with the CICs between them larger than 1 .OE-2. The thickness of the link is 
proportional to the strength of the corresponding CIC. When a pair of genes has bidirectional links, the stronger link is drawn as a straight line and 
the weaker one is drawn as a curved line (see the case of APC < — >TP53). The asymmetry can be observed by the fact that no bidirectional links of 
similar strengths exist between gene pairs, and APC, TP53, and MLL3 each play a hub role in the network. 
doi:10.1371/journal.pone.0089244.g004 



genes, were used in the iterative procedure of CIC computation 
and order inference. Table SI lists the 1,118 samples. The results 
were found to converge within 10 iterations. After convergence, 
CICs with a value greater than 1.0E-6 presented a power law-like 
distribution over the magnitudes, such that the overwhelming 
majority has a magnitude less than the average of 4.0E-4 and a 
very small portion has a larger than average magnitude (Fig. 3, 
Table S2). This feature is also true for the distribution of the 
magnitudes of the CICs from (or to) a given gene in most cases, 
which means that only a small number of partners are significant 
in terms of carcinogenesis information conduction for any given 
gene. In other words, the CICs identify the closest partners in 
carcinogenesis information conduction. Furthermore, the directed 
networks of cancer genes linked by the CICs were asymmetrical 
and small world-like. The CIC from gene i to gene j was usually 
unequal to that from genej to gene i; the network has a number of 
hub genes with many more links than normal. This feature is 
consistent with the notion that the signaling network in cancer is 
analogous to the Internet, which constructs a small world with hub 
nodes [27-29]. Fig. 4 illustrates a CIC linked network covering 44 
cancer genes, including the hub genes APC, TP53 and A1LL3, and 
the links stronger than 1 .OE-2 showing asymmetry. The asymme- 
try of the CICs implies the existence of a preference for certain 
mutation orders. Additionally, the three hub genes are all tumor- 
suppressor genes, and the strongest directed link, with a value of 
0.136, is from APC to KRAS, one of the most frequently mutated 
oncogenes, suggesting a superior information channel from the 
mutation of APC to the mutation of KRAS. 

The inference of probable orders 

The inferred mutation orders with a probability higher than 
random chance, referred to hereafter as probable orders, provided 
more concrete insights into tumor progression. We analyzed the 
probable orders inferred for the 1,118 cancer samples in question 
to investigate a maximum of 8 mutation steps from initiation. The 
primary sites of the samples were mainly located in the ovary (256), 



large_intestine (LI, 180), haematopoietic _and_lymphoid_tis sue (HLT, 148), 
prostate (100), breast (97), central_nervous_system (CMS, 86), and 
upper _aerodigestive_tract (UAT, 72). 

Table S3 lists all the probable orders and their probabilities in 
the analyzed samples, and Table 1 shows a selection of them. 
Based on the probable orders, we concluded that in a given sample 
only a small portion of all the potential orders has a probability 
higher than random chance, and the sum total of the probabilities 
of those orders is close to the number of samples with a ratio of 
1034.4/1118. This indicates that the inference identified a small 
portion of all the potential orders permutated by the given set of 
mutant cancer genes as the probable orders. For a sample 
harboring two mutant cancer genes, the inference always strongly 
suggests one of the two potential orders. However, for the samples 
with more than two mutant cancer genes, some orders might have 
comparable high probabilities. Although we cannot judge the 
individual plausibilities of the inferred probable orders because of 
the lack of ground truth for the orders in most cases, their 
significance could be strongly suggested by evaluating the 
inference with samples of a certain cancer type that have been 
well studied in terms of order. For example, APC, KRAS and TP53 
are the three most frequently mutated genes in colon cancers, and 
their mutation orders have been well modeled [30,31]. In our 
results, the sample with mutant cancer genes APC and KRAS, 
yielded an inferred probability of 0.95 for the order APC-^KRAS, 
which was consistent with previous studies. For the sample with 
mutant APC, KRAS and TP53 cancer genes, three probable orders 
of APC-^KRAS-^TP53 (0.33), APC-^TP53^KRAS (0.32) and 
TP53^>APC-^>KRAS (0.19) were inferred from 6 potential ones, 
and this result was also consistent with previous studies. BRCA1 
germline mutations confer a high risk of breast and ovarian 
cancer, but somatic loss of the wild-type BRCA1 allele has been 
shown to usually occur after mutation of TP53 [32] . In agreement 
with this observation, we inferred the somatic mutation order 
TP53^BRCA1 with a probability greater than 0.99. These 
examples provide evidence to support the inference validity. 
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Table 1. Examples 


of cancer gene mutation orders with a predicted probability greater than 


random chance. 




Order 


Probability 


Order 


Probability 


APC-^BRCA2 


0.99932 


APC^ATM^KRAS 


0.39011 


APC-^FBXW7 


0.99675 


APC^KRAS^ATM 


0.41748 


APC-^KRAS 


0.95281 


APC-^BRAF^TP53 


0.20884 


APC-^NOTCH2 


0.99906 


APC-^TP53^BRAF 


0.34198 


APC-^PTEN 


0.99986 


TP53^APC^BRAF 


0.2958 


APC-^SMARCA4 


0.99934 


APC^KRAS^TP53 


0.33317 


APC-^TP53 


0.65624 


APC^TP53^KRAS 


0.32532 


ARIDIA-^ATM 


0.99991 


TP53^APC-^KRAS 


0.18769 


ARIDIA^CTNNBI 


0.99987 


APC^MLL3^PTEN 


0.49936 


ARIDIA-^PTEN 


0.99966 


APC^PTEN^MLL3 


0.49797 


ARID2^CTNNB1 


0.99992 


CREBBP—>KRAS-^ARID 1A 


0.36621 


BRAF^FBXW7 


0.99994 


CREBBP—>ARID 1A-^KRAS 


0.56109 


BRAF—>PTEN 


0.99792 


CREBBP—>CTNNB 1^SMARCA4 


0.27984 


CREBBP^BRCA2 


0.9995 


CREBBP—>SMARCA4-^CTNNB1 


0.71849 


CREBBP^CTNNBI 


0.99906 


MLL3^SMARCA4-^EP300 


0.49603 


CTNNB1^SMARCA4 


0.7542 


MLL3^EP300^SMARCA4 


0.4967 


EZH2^CARD11 


0.99922 


PIK3CA^SMARCA4^CTNNB 1 


0.46455 


EZH2^CTNNB1 


0.96595 


PIK3CA^CTNNB 1^SMARCA4 


0.52713 


KRAS^FBXW7 


0.75719 


TP53^MLL3-^ARID1A 


0.48265 


KRAS—>PTEN 


0.99874 


TP53^ARID1A-^MLL3 


0.45573 


MLL2^CTNNB1 


0.99989 


TP53^MLL3-^ARID2 


0.4851 


MYH1 1^SMARCA4 


0.99606 


TP53^ARID2^MLL3 


0.46217 


N0TCH2^AR!D1A 


0.99964 


TP53^>CTNNB I— »/ATM 


0.49888 


NRAS^SMARCA4 


0.95881 


TP53^ATM-^CTNNB1 


0.50012 


PIK3CA^ARID1A 


0.99971 


TP53^ROS1^BRAF 


0.73047 


PIK3CA-^EP300 


0.99954 


TP53^BRAF^ROS1 


0.26871 


TP53-^ARID1A 


0.99985 


TP53^PIK3CA-^BRCA2 


0.49745 


TP53^AKAP9 


0.99941 


TP53^BRCA2^PIK3CA 


0.49757 


TP53^ARID2 


0.99985 


TP53^EZH2^CREBBP 


0.47725 


TP53^ATM 


0.99974 


TP53^CREBBP^EZH2 


0.52201 


TP53—>BRAF 


0.99937 


TP53^MLL2^CREBBP 


0.56795 


TP53^BRCA2 


0.99827 


TP53^CREBBP^MLL2 


0.43152 


TP53^CARD11 


0.99926 


TP53^NOTCH2^CREBBP 


0.50977 


TP53^CREBBP 


0.99989 


TP53^CREBBP-^NOTCH2 


0.4884 


TP53^CTNNB1 


0.99984 


TP53^FBXW7^CTNNB 1 


0.47758 


TP53^EP300 


0.99894 


TP53^CTNNB 1^FBXW7 


0.47344 


TP53^EZH2 


0.99985 


TP53^KRAS^CTNNB1 


0.42094 


TP53^KRAS 


0.83612 


TP53^CTNNB1^KRAS 


0.4181 


TP53^MLL 


0.99992 


TP53^MLL2^EZH2 


0.5875 


TP53^MLL2 


0.9999 


TP53^EZH2^MLL2 


0.41196 


TP53^MLL3 


0.98377 


TP53^KRAS^FBXW7 


0.38181 


TP53^MYH11 


0.99954 


TP53^FBXW7^KRAS 


0.31798 


TP53^NRAS 


0.94242 


TP53^PIK3CA-^KRAS 


0.40396 


TP53^PIK3CA 


0.99789 


TP53^KRAS^PIK3CA 


0.36917 


TP53^PTEN 


0.99989 


TP53^PIK3CA^MYH11 


0.49797 


TP53^SMARCA4 


0.99988 


TP53^MYH1 1^PIK3CA 


0.49427 



The examples in the table are selected based on that the genes have a mutant frequency greater than 40 in the COSMIC database to show the estimation for the 
common cases. Due to the limited space, the prediction for the samples with more than 3 mutant cancer genes is not shown. Refer to Table S3 for a complete result. 
The random chance is Mnl, where n is the number of mutant cancer genes in the sample. 
Note, any two samples with the same set of mutant genes have identical predicted results. 
doi:1 0.1 371 /journal.pone.0089244.t001 
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Table 2. The type of a cancer gene and its probability-weighted times of starting an order with a probability greater than random 
chance. 



Gene 


Type 


Times 


Rate 


Gene 


Type 


Times 


Rate 


TP53 


s 


484.732 


0.844 


XP01 


o 


; 


0.077 


APC 


s 


117.471 


0.675 


TIF1 


0 




0.125 


PIK3CA 


o 


31.97 


0.381 


ERCC5 


s 




0.083 


MLL3 


5 


26.927 


0.481 


ELN 


0 




0.125 


KRAS 


O 


24.093 


0.223 


LRIG3 


o 




0.167 


MLL2 


5 


20.235 


0.355 


LM01 


0 


; 


0.5 


CREBBP 


s/o 


17.294 


0.402 


ASPSCR1 


o 




1 


ATM 


5 


14.404 


0.4 


NC0A1 


0 




0.143 


ARID1A 


S 


13.625 


0.296 


SRSF2 


o 




0.5 


NPM1 


5 


12.96 


0.48 


CBL 


s/o 




0.2 


EZH2 


S 


10.733 


0.335 


MYD88 


o 




0.125 


ac 


5 


10.649 


0.41 


NUP98 


0 




0.01 


ARID2 


S 


8.699 


0.335 


PTPN11 


o 




0.2 


R0S1 


0 


7.888 


0.316 


CDH1 


s 


0.999 


0.062 


TET2 


s 


7.049 


0.441 


PERI 


o 


0.999 


0.125 


WT1 


s 


7.027 


0.502 


MEN1 


s 


0.999 


0.1 


CTNNB1 


o 


6.862 


0.114 


PRDM1 


s 


0.999 


0.125 


PBRM1 


s 


6.854 


0.685 


1121 R 


0 


0.998 


0.25 


PTEN 


s 


6.801 


0.235 


MUC1 


o 


0.997 


0.332 


BRAE 


0 


5.988 


0.23 


PIK3R1 


s 


0.997 


0.066 


N0TCH2 


o 


5.907 


0.281 


TR1P1 1 


o 


0.996 


0.1 


MYST4 


0 


5.492 


0.25 


ITK 


0 


0.995 


0.059 


SMARCA4 


s 


5.45 


0.156 


NC0A2 


0 


0.995 


0.199 


DNMT3A 


5 


5.293 


0.23 


BCOR 


5 


0.995 


0.059 


ASXL1 


s 


5.064 


0.362 


TSHR 


o 


0.995 


0.124 


MYH9 


0 


4.988 


0.416 


MAP2K4 


5 


0.994 


0.076 


KDM5A 


o 


4.987 


0.623 


GNAS 


o 


0.994 


0.062 


NSD1 


0 


4.965 


0.414 


MSH2 


5 


0.993 


0.199 


MYH11 


o 


4.561 


0.198 = = = 


SRGAP3 


O 


0.993 


0.165 


BCL2 


0 


4.165 


0.116 


FGFR2 


0 


0.992 


0.142 


RET 


o 


3.991 


0.333 


FANCD2 


s 


0.991 


0.33 


EP300 


s 


3.991 


0.174 


MET 


0 


0.991 


0.099 


ALK 


o 


3.972 


0.306 


EWSR1 


o 


0.991 


0.142 


PHF6 


s 


3.264 


0.297 


BRIP1 


s 


0.99 


0.165 


BLM 


s 


3.076 


0.22 


SETD2 


s 


0.99 


0.062 


IDH2 


0 


3 


0.214 


BCL6 


0 


0.989 


0.11 


C0L1A1 


o 


2.997 


0.231 


WHSC1 


o 


0.985 


0.123 


GNA11 


0 


2.996 


0.749 


BAP1 


s 


0.983 


0.089 


SF3B1 


o 


2.993 


0.136 


ETV6 


o 


0.953 


0.136 


TPR 


0 


2.987 


0.373 


FLU 


0 


0.902 


0.18 


CDH11 


o 


2.987 


0.187 


DNM2 


s 


0.89 


0.064 


PDGFRA 


0 


2.971 


0.186 


BCR 


0 


0.888 


0.296 


MLL 


o 


2.847 


0.142 


VHL 


s 


0.818 


0.055 


KIT 


0 


2.59 


0.173 


FANCC 


5 


0.71 1 


0.71 1 


MED12 


o 


2.514 


0.168 


MAF 


o 


0.682 


0.341 


EBXW7 


s 


2.428 


0.069 


RUNX1 


0 


0.65 


0.043 


RBI 


s 


2.204 


0.092 


CARD11 


o 


0.5 


0.019 


LIFR 


0 


2.175 


0.128 


LHFP 


0 


0.5 


0.167 
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Table 2. Cont. 





Gene 


Type 


Times 


Rate 


Gene 


Type 


Times 


Rate 


AKAP9 


o 


2. 756 


0.108 


NUMA1 


o 


0.453 


0.032 


NRAS 


0 


2.037 


0.06 


SUZ12 


0 


0.397 


0.066 


EGFR 


o 


2 


0.154 


KIAA1549 


o 


0.319 


0.106 


JAK2 


0 


2 


0.2 


NC0A4 


0 


0.314 


0.157 


PALB2 


s 


1.991 


0.153 


FHIT 


o 


0.257 


0.086 


NF1 


s 


1.989 


0.071 


TNFAIP3 


s 


0.254 


0.023 


JAK1 


o 


1.986 


0.397 


NFKB2 


o 


0.22 


0.11 


PRDM16 


0 


1.97 


0.246 


MDM4 


0 


0.198 


0.066 


CAMTA 1 


o 


1.703 


0.122 


H3F3A 


o 


0.197 


0.197 


NOTCH 1 


0 


1.681 


0.06 


HI PI 


0 


0.196 


0.028 


CCND2 


o 


1.628 




SUFU 


s 


0.188 


0.047 


SMARCB1 


5 


1.601 


0.543 


SBDS 


s 


0.184 


0.092 


NF2 


s 


1.5 


0.133 


JAK3 


o 


0.144 


0.013 


FOXP1 


0 


1.5 


0.3 


H00K3 


0 


0.11 


0.026 


CD74 


o 


1.499 


0.15 


TCF7L2 


o 


0.094 


0.006 


CCND3 


0 


1.404 


0.75 


EIF4A2 


0 


0.069 


0.007 


PIM1 


o 


1.402 


0.14 


ERBB2 


o 


0.06 


0.003 


SLC45A3 


0 


1.384 


0.175 


ZNF331 


0 


0.052 


0.013 


IL7R 


o 


1.27 


0.154 


C15orf55 


o 


0.046 


0.011 


BTG1 


0 


1.174 


0.159 


H0XA11 


0 


0.045 


0.015 


ZNF521 


o 


1.148 


0.196 


PH0X2B 


s 


0.018 


0.006 


PDE4DIP 


0 


1.099 


0.057 


PCM1 


0 


0.008 


0.002 


MSH6 


s 


1.017 


0.061 


MALT1 


o 


0.003 


0.001 


LM02 


0 


1 


0.17 


Total 




1034.4 





s: tumor-suppressor gene, o: oncogene. The ratios of the summed frequencies of s, o, and s/o to the total are 0.775, 0.207 and 0.018, respectively. 
Rate is the ratio of Times to the gene's total mutation number in the samples in question. 
doi:1 0.1 371 /journal.pone.0089244.t002 



Initiators of probable mutation orders 

Identifying the initiators of mutation orders has been regarded 
as one of the major challenges in the study of tumor progression 
[1]. Our inferred probable orders of mutation provided informa- 
tive hints to solving this challenge. By examining the genes that 
initiate the probable orders, we found that the initiators were 
dominated by tumor-suppressor genes. An overwhelming majority 
(more than 77.5%) of the probability-weighted number of the 
probable orders was inferred to be initiated by a tumor-suppressor 
gene rather than an oncogene. There were 368 cancer genes in the 
test cancer samples, among them only 92 were tumor suppressors. 
More specifically, there were 1,858 mutations of tumor suppres- 
sors among totally 3,823 mutations of all the cancer genes. 
Therefore the average chance for tumor suppressors to initiate the 
mutation orders was 48.6% (1858/3823). This demonstrates that 
the dominance of tumor suppressors in initiating the mutation 
orders could not be ascribed to chance. Additionally, the ratios of 
the number of times a gene was the initiator to its mutation 
frequency were generally different, implying that it is not certain 
that frequently mutated genes will mutate early (Table 2). 
Significantly, the probability-weighted number of the probable 
orders started by the top two tumor-suppressor gene initiators 
TP53 and APC, consisted of percentages as large as 46.9% and 
11.4%, respectively. In contrast, the top two oncogene initiators, 
PIK3CA and KRAS, were found in percentages as small as 3.1% 
and 1.3%, respectively. The top initiators of mutation at the 



respective primary cancer sites suggested more details (Table 3). In 
general, all cancers at the major primary sites of the samples 
revealed a tumor-suppressor gene as their top initiator. In 
particular, TP53 was a common top initiator in four of the 
previously listed cancer types, ovary, UAT, breast and prostate, with 
percentages of 91.5%, 73.4%, 57.6% and 30.4%, respectively. In 
LI cancers, the top initiator was APC (57.5%), followed by TP53 
(29.7%). Both CMS and HLT cancers had no obviously superior 
initiators, with CIC (13.6%), PIK3CA (10.1%) and TP53 (10.0%) as 
the top three initiators for the former, and TP53 (14.9%), MPM1 
(10.4%) and MLL2 (9.9%) as the top three initiators for the latter. 
From the perspective of initiator distribution, ovary, LI, UAT and 
breast cancers were inferred to be dominated by a small number of 
tumor-suppressor genes, while HLT, CMS and prostate cancers were 
inferred to have more diverse significant initiators. 

Previous studies have suggested a number of hallmark functions 
that need to be acquired for a cancer to generate, helping 
researchers understand the complexity in tumor progression in a 
way of logical, scientific manner [33,34]. Our inferred results point 
to a suggestion that goes one step further. In most cancers, the 
earliest acquired hallmark function might be evading apoptosis 
because the majority of first mutated genes in every cancer type in 
Table 3 (TP53, APC, KRAS, PIK3CA, MPM1 and CIC] have been 
found to encode apoptosis-regulating proteins, and the mutation of 
all of these genes has been shown to lead to deficient apoptosis 
functions. Specifically, the mutation of TP53 can result in the 
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Table 3. The top initiators of the probable mutation orders in different cancer types. 



Cancer 


Gene 


Type 


Percent 


Cancer 


Gene 


Type 


Percent 


Ovary 


TP53 


5 


97.5 


Prostate 


NC0A1 


o 


/./ 


Ovar^ 


A PC 




1.7 


Prost3tc 


Others 




18.9 


Ovar^ 


PIK3CA 




1.5 


PfOStdtG 


Total 




59.0 


Ovar^ 


ARID! A 




0.8 


Proststc 


Total 




39.9 


Ovar^ 


KRAS 




0.7 


Breast 


TP53 




57.6 


Ovar^ 


NF2 




0.4 


Breast 


PIK3CA 




1 1.8 


Ovar^ 


BRAF 




0.4 


Breast 


MLL3 




4.7 


ollry 


RB1 




0.4 


Breast 


PTEN 


5 


2.2 


Ova^ 


MYST4 


o 


0.4 


Breast 


ATM 


5 


2.1 


Ova^ 


GNA11 


o 


0.4 


Breast 


R0S1 


o 


2. 1 


Ova^ 


EGFR 




0.4 


Breast 


AKAP9 






Ova^ 


LIFR 




0.4 


Breast 


PRDM16 






Ova^ 


MYH9 




0.4 


Breast 


SMARCA4 






Ova^ 


KIT 




0.4 


Breast 


MYH1 1 






Ova^ 


MLL3 




0.3 


Breast 


MYH9 






Ova^ 


Total 




95. 1 


Breast 


ASPSCR 1 






Ova^ 


Total 


o 


4.9 


Breast 


ARID 1 A 


5 




L| 


ARC 




57.5 


Breast 


PDGFRA 






L| 


TP53 


5 


29.7 


Breast 


BRAF 






L| 


KRAS 




6.8 


Breast 


NSD1 






L| 


ATM 




1.2 


Breast 


PDE4DIP 






L| 


PIK3CA 




0.9 


Breast 


MAP2K4 






L| 


NSD1 




0.6 


Breast 


FANCD2 






L| 


IL21R 




0.6 


Breast 


MET 






L| 


GNA11 




0.6 


Breast 


Others 




4. 1 


L| 


ALK 


o 


0.6 


Breast 


Total 


5 


75.2 


L| 


PDGFRA 


o 


0.6 


Breast 


Total 


Q 


24.7 


L| 


ARID! A 




0.5 


CNS 


C/C 




13.6 


L| 


MYH1 1 




0.3 


CNS 


PIK3CA 




10. 1 


L| 


TCF7L2 




0. 1 


CNS 


TP53 




10 


L| 


Total 




88.9 


CNS 


MLL3 




5.7 


L| 


Total 




1 jo 


CNS 


MLL2 




5.5 


HLT 


TP53 




14.9 


CNS 


CTNNB 1 




5.2 


HLT 


NPM1 




10.4 


CNS 


CREBBP 


s/o 


5. 1 


HLT 


MLL2 


5 


9.9 


CNS 


ATM 




5.1 


HLT 


EZH2 


5 


7.6 


CNS 


SMARCA4 




4.4 


HLT 


WT1 




5.6 


CNS 


NOTCH 2 




3.7 


HLT 


TET2 




4.8 


CNS 


APC 




2.5 


HLT 


DNMT3A 




4.2 


CNS 


KDM5A 




2.5 


HLT 


BCL2 




3.3 


CNS 


R0S1 




2.5 


HLT 


PHF6 




2.6 


CNS 


AKAP9 




1.4 


HLT 


IDH2 




2.4 


CNS 


COL 1A 1 




1.3 


HLT 


RET 


0 


2.4 


CNS 


TIF1 


o 


1.3 


HLT 


ASXL1 


5 


2.4 


CNS 


ARID2 


s 


1.3 


HLT 


CREBBP 


s/o 


2.1 


CNS 


TPR 


o 


1.3 


HLT 


NRAS 


0 


1.6 


CNS 


EP300 


s 


1.3 


HLT 


ATM 


s 


1.6 


CNS 


BLM 


s 


1.3 


HLT 


CDH11 


0 


1.6 


CNS 


Others 




14.9 


HLT 


KRAS 


0 


1.6 


CNS 


Total 


s 


56.6 
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Table 3. Cont. 



Cancer 


Gene 


Type 


Percent 


Cancer 


Gene 


Type 


Percent 


HLT 


CD74 


0 


1.2 


CNS 


Total 


0 


38.3 


HLT 


CCND3 


0 


1.1 


UAT 


TP53 


s 


73.4 


HLT 


PIM1 


0 


1.1 


UAT 


NSD1 


0 


3 


HLT 


Others 




17.6 


UAT 


NOTCH1 


o 


2.4 


HLT 


Total 


s 


67.5 


UAT 


FBXW7 


s 


1.9 


HLT 


Total 


0 


29.6 


UAT 


MLL3 


s 


1.9 


Prostate 


TP53 


s 


30.4 


UAT 


CREBBP 


s/o 


1.5 


Prostate 


APC 


s 


9.9 


UAT 


EP300 


s 


1.5 


Prostate 


MLL3 


s 


4.5 


UAT 


FOX PI 


0 


1.5 


Prostate 


PTEN 


s 


4.3 


UAT 


MED12 


o 


1.5 


Prostate 


MYST4 


0 


4 


UAT 


GNAS 


0 


1.5 


Prostate 


ATM 


s 


3.9 


UAT 


JAK1 


o 


1.5 


Prostate 


KDM5A 


0 


3.3 


UAT 


ASXL1 


s 


1.4 


Prostate 


PIK3CA 


0 


3.3 


UAT 


MLL2 


s 


1.1 


Prostate 


MYH9 


0 


2.2 


UAT 


FANCC 


s 


1.1 


Prostate 


TPR 


0 


2.2 


UAT 


CAMTA1 


o 


1.1 


Prostate 


MLL 


0 


2 


UAT 


APC 


s 


1 


Prostate 


MED12 


0 


1.7 


UAT 


CCND2 


o 


1 


Prostate 


SLC45A3 


0 


1.5 


UAT 


NUMA1 


0 


0.5 


Prostate 


RBI 


s 


1.3 


UAT 


KIAA1549 


o 


0.5 


Prostate 


CREBBP 


s/o 




UAT 


EZH2 


s 


0.4 


Prostate 


PALB2 


s 




UAT 


Others 




0.3 


Prostate 


ZNF521 


0 




UAT 


Total 


s 


83.9 


Prostate 


ECFR 


0 




UAT 


Total 


o 


14.6 


Prostate 


LRIG3 


0 













Percent: the sum of the probabilities of the orders initiated by the gene versus the total of the probabilities of the orders in the same cancer type. 
At most top 20 initiators are listed for each cancer type due to the limited space. 
doi:1 0.1 371 /journal.pone.0089244.t003 



removal of a key component of the DNA damage sensor, which 
functions to induce apoptosis [33,34], mutant forms of the APC 
protein can attenuate responses to apoptotic stimuli [35,36], the 
mutations in KRAS and PIK3CA can activate pathways that 
transmit anti-apoptotic survival signals [33], and the proteins 
encoded by JVPM1 and CIC have been shown to function in 
apoptosis [37,38]. 

Informative transitions in the probable orders 

The transitions in the probable orders provided additional 
information on tumor progression. Though mutations in BRCA1 
and BRCA2 have been regarded as key markers for breast cancer 
occurrence, somatic mutations in the two genes in the breast 
cancer samples were not very frequent, with rates of 3/97 and 6/ 
97, respectively, and both genes were inferred to have no chance 
of initiating a probable order. However, among all transitions in 
the probable orders of the breast cancers, TP53^BRCA2 and 
TP53^>BRCA1 were identified as the second and fourth most 
frequent transitions, respectively, implying that mutations in these 
two genes tend to occur next to the mutation of TP53. Similarly, 
the transition of TP53^>BRCAlwas ranked as the third most 
frequent in the probable orders in ovarian cancer, supporting the 
conjecture mentioned above. In LI cancers, mutations in APC, 
TP53 and KRAS were found to occur at extraordinarily frequencies 
with rates of 146/180, 111/180 and 79/180, respectively, and 



their mutual transitions were the top six most frequent, implying 
that these three genes as a group play dominating roles in LI 
cancers. Liquid HLT cancers were inferred to have the 3 most 
frequent transitions that converged on one gene, TP53^BCL2, 
MLL2—>BCL2 and EZH2-^BCL2. Given that BCL2 is a key anti- 
apoptotic gene [39] and was the most frequently mutated gene in 
HLT cancer samples, these convergent transitions suggest that 
HLT cancers might acquire the function of evading apoptosis in a 
unique way, mutation of the key anti-apoptotic gene BCL2 next to 
the mutations of certain tumor-suppressor genes. Informatively, 
among all the 36 BCL2 mutant samples, mutations at 179C and 
392C of CDS (Coding DNA Sequence) were as frequent as 5 and 
4 times, respectively, suggesting those to be hotspot mutations that 
might play a particular role in evading apoptosis. Because TP53, 
AILL2 and E^H2 were inferred to be the top initiators of mutation 
in HLT cancer samples, the function of evading apoptosis could be 
acquired in an early stage of tumor progression. 

Discussion 

The inferred results from individual samples firmly revealed the 
order heterogeneity in a given cancer type, showing the 
complexity of the disease. The results also highlighted the limited 
number of genes that are able to initiate the mutations and 
revealed that the hallmark function of evading apoptosis is 
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acquired early. Other regularities implied in the results might also 
be significant in understanding and treating the disease. 

The proposed approach for inferring the temporal order of 
mutations is superior to existing methods in two ways: 1) it can be 
used to infer the order of mutation in individual samples with 
mutations in multiple genes which have been identified at the 
same time; and 2) it is completely data-driven, free from the 
difficulty in existing methods of setting proper parameters, such as 
fitness, mutation rate or waiting time [9-11]. When this approach is 
better supported by more sufficient data, it is expected to help 
discover more reliable information to understand the mechanism 
of carcinogenesis. Fortunately, the wide application of a new 
generation of sequencing will make this hope a reality. 

The key to the success of this study is finding the statistical 
measurement of CIC, which proved to usually be asymmetrical 
between a pair of cancer genes, laying the foundation of mutation 
order inference. Meanwhile, the iterative procedure provides a 
feasible way to infer the CICs from non-sequential co-occurrence 
frequencies. With the CICs, the linkages between cancer gene 
mutations are modeled as a complex network with directed links. 
The small world-like nature of the complex network makes the 
inference of the temporal order of mutations effective. 
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